This is a data set looking at the visits, orders, and revenue from dotcom broken down by whether it happened on a large or small screen device.
This is an example where we will:
#### data directory
setwd("/Users/a149174/UST_GPS/seis_763/r/seis_763_machine_learning/time_series")
#### Load data
datl <- read.csv("dat-all-long.csv", stringsAsFactors=FALSE) %>%
mutate(dt = as.Date(dt)) %>%
tbl_df()
head(datl)
## Source: local data frame [6 x 3]
##
## dt k v
## 1 2013-03-31 lg.visits 1725827
## 2 2013-04-01 lg.visits 1930250
## 3 2013-04-02 lg.visits 1724775
## 4 2013-04-03 lg.visits 1665799
## 5 2013-04-04 lg.visits 1730834
## 6 2013-04-05 lg.visits 1894947
class(datl)
## [1] "tbl_df" "tbl" "data.frame"
It’s important have a good idea of what you have in your actual data set. To this end, you should do some reasonable diagnostics on your data so that you are sure that you have what you are expecting. We do that below:
#### What does the top of the data look like?
datl %>% print()
## Source: local data frame [5,796 x 3]
##
## dt k v
## 1 2013-03-31 lg.visits 1725827
## 2 2013-04-01 lg.visits 1930250
## 3 2013-04-02 lg.visits 1724775
## 4 2013-04-03 lg.visits 1665799
## 5 2013-04-04 lg.visits 1730834
## 6 2013-04-05 lg.visits 1894947
## 7 2013-04-06 lg.visits 1553546
## 8 2013-04-07 lg.visits 1690491
## 9 2013-04-08 lg.visits 1675313
## 10 2013-04-09 lg.visits 1572335
## .. ... ... ...
#### What does the end of your dataset look like?
datl %>% tail()
## Source: local data frame [6 x 3]
##
## dt k v
## 1 2015-11-16 sm.revenue 1491690
## 2 2015-11-17 sm.revenue 1536976
## 3 2015-11-18 sm.revenue 1448862
## 4 2015-11-19 sm.revenue 2923812
## 5 2015-11-20 sm.revenue 2623574
## 6 2015-11-21 sm.revenue 2867591
####
#### Date (dt) diagnostics
####
### Beginning/end of date range
summary(datl$dt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2013-03-31" "2013-11-27" "2014-07-26" "2014-07-26" "2015-03-25" "2015-11-21"
### How many unique date values do I have in the dataset?
datl$dt %>% unique() %>% length()
## [1] 966
### I expect that the dates are contiguous between the min and max value. Are they?
### The below statement will return dates that are not between the min and max values in the data.
### If nothing is returned, all dates are present.
setdiff(
seq.Date(from=min(datl$dt), to=max(datl$dt), by=1) # Generate a sequence of dates between beginning and end of data.
, datl$dt %>% unique()) # Compare to dates in data
## numeric(0)
### For each date, I expect the same number of rows, or more specifically unique key (k) values
### This code will count how many distinct k-value combinations are in the data. The result:
### 6
### 966
### Means that: 966 distinct values (all of them) each had 6 unique values of k.
datl %>%
group_by(dt) %>%
summarise(n_k=n_distinct(k)) %>%
ungroup() %>%
{table(.$n_k)}
##
## 6
## 966
####
#### key (k) values
####
### What are the unique keys in column k?
datl$k %>% unique()
## [1] "lg.visits" "lg.orders" "lg.revenue" "sm.visits" "sm.orders" "sm.revenue"
####
#### Do we have any rows that have missing data?
####
### It is nice to know if any data has missing values in the cell. Here is a quick way to check:
### If all cells have data, this will return no rows. It will return any row that has a missing value in
### any of the columns.
datl[!complete.cases(datl),]
## Source: local data frame [0 x 3]
####
#### Do the values (v) have the ranges you expect?
####
d_ply(datl, ~ k, function(df) {
cat("\n----------\n")
cat("key value: ", df$k %>% unique(), "\n")
summary(df) %>% print()
})
##
## ----------
## key value: lg.orders
## dt k v
## Min. :2013-03-31 Length:966 Min. : 16922
## 1st Qu.:2013-11-27 Class :character 1st Qu.: 23873
## Median :2014-07-26 Mode :character Median : 27746
## Mean :2014-07-26 Mean : 35492
## 3rd Qu.:2015-03-24 3rd Qu.: 32629
## Max. :2015-11-21 Max. :612428
##
## ----------
## key value: lg.revenue
## dt k v
## Min. :2013-03-31 Length:966 Min. : 3432037
## 1st Qu.:2013-11-27 Class :character 1st Qu.: 5466069
## Median :2014-07-26 Mode :character Median : 6494446
## Mean :2014-07-26 Mean : 8152359
## 3rd Qu.:2015-03-24 3rd Qu.: 7495769
## Max. :2015-11-21 Max. :161603019
##
## ----------
## key value: lg.visits
## dt k v
## Min. :2013-03-31 Length:966 Min. : 1380483
## 1st Qu.:2013-11-27 Class :character 1st Qu.: 1828408
## Median :2014-07-26 Mode :character Median : 1961467
## Mean :2014-07-26 Mean : 2210344
## 3rd Qu.:2015-03-24 3rd Qu.: 2168118
## Max. :2015-11-21 Max. :11894668
##
## ----------
## key value: sm.orders
## dt k v
## Min. :2013-03-31 Length:966 Min. : 170
## 1st Qu.:2013-11-27 Class :character 1st Qu.: 2906
## Median :2014-07-26 Mode :character Median : 4024
## Mean :2014-07-26 Mean : 5760
## 3rd Qu.:2015-03-24 3rd Qu.: 5718
## Max. :2015-11-21 Max. :190944
##
## ----------
## key value: sm.revenue
## dt k v
## Min. :2013-03-31 Length:966 Min. : 28055
## 1st Qu.:2013-11-27 Class :character 1st Qu.: 506638
## Median :2014-07-26 Mode :character Median : 728351
## Mean :2014-07-26 Mean : 1037270
## 3rd Qu.:2015-03-24 3rd Qu.: 1018536
## Max. :2015-11-21 Max. :44266671
##
## ----------
## key value: sm.visits
## dt k v
## Min. :2013-03-31 Length:966 Min. : 489582
## 1st Qu.:2013-11-27 Class :character 1st Qu.: 816155
## Median :2014-07-26 Mode :character Median :1047774
## Mean :2014-07-26 Mean :1181721
## 3rd Qu.:2015-03-24 3rd Qu.:1331734
## Max. :2015-11-21 Max. :8652817
####
#### Plot the data
####
### Plotting just helps you find any anomalies that might just show up visually.
p <- ggplot(datl, aes(dt, v)) +
facet_grid( k ~ ., scale="free_y" ) +
geom_line() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1),
strip.text.y = element_text(angle = 0)) +
scale_x_date(breaks=date_breaks("month")) +
xlab("Date") +
scale_y_continuous(labels=comma) +
ylab("Metric") +
ggtitle("Metric vs. Date, facet Metric")
print(p)
A forecast of sm.visits will be made. The question arises as to what is the best way to model a forecast on this. The approach we will adopt here is that the most stable time series to forecast is the combined visits ( all.visits = sm.visits + lg.visits), and then forecast the ratio of sm.visits:all.visits, and multiply the two forecasts.
In order to do this, we simplify our dataset by keeping only the visits data, and by calculating the all.visits column.
#### Create a dataset we want.
datv <- datl %>%
filter(str_detect(k, "visits")) %>% # Keep only metrics pertaining to visits
spread(k, v) %>% # Convert from long to wide format of data, so we have 1 dt value per row, and the metrics as separate columns
mutate(all.visits = sm.visits + lg.visits) %>% # Easy compute of all.visits
mutate(sm2all = sm.visits / all.visits) # Get small to all visits ratio
#### Check the data form out
datv %>% print()
## Source: local data frame [966 x 5]
##
## dt lg.visits sm.visits all.visits sm2all
## 1 2013-03-31 1725827 1042087 2767914 0.3764882
## 2 2013-04-01 1930250 727744 2657994 0.2737944
## 3 2013-04-02 1724775 649842 2374617 0.2736618
## 4 2013-04-03 1665799 618400 2284199 0.2707295
## 5 2013-04-04 1730834 644915 2375749 0.2714575
## 6 2013-04-05 1894947 697160 2592107 0.2689549
## 7 2013-04-06 1553546 748533 2302079 0.3251552
## 8 2013-04-07 1690491 840374 2530865 0.3320501
## 9 2013-04-08 1675313 619243 2294556 0.2698749
## 10 2013-04-09 1572335 587690 2160025 0.2720756
## .. ... ... ... ... ...
#### Look at the data we just generated
datt <- datv %>% select(dt, all.visits, sm2all) %>% gather(k, v, -dt)
p <- ggplot(datt, aes(dt, v)) +
facet_grid( k ~ ., scale="free_y" ) +
geom_line() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1),
strip.text.y = element_text(angle = 0)) +
scale_x_date(breaks=date_breaks("month")) +
xlab("Date") +
scale_y_continuous(labels=comma) +
ylab("Metric") +
ggtitle("Metric vs. Date, facet Metric")
print(p)
all.visits looks reasonable.
sm2all appears to have some anomalies out around 2015-08-01 and 2015-09-15. After talking with Rachel Wright, those are dates where the way that small view visits had changes to the way they were reported. It seems to have been reflected in the data. It is a bit hard to see this effect in the raw data, but is somewhat easier to see in the ratio sm2all.
These level jumps can make forecasting hard. The assumption is that if the latest data adjustment is the correct data, the previous data is inflated. How do we deal with this when making forecasts, if our forecasts depend on historic data? We’ll discuss further below.
One of the issues with this problem now is that ranges of the data are inflated. How do we deal with this inflated data when forecasting all.visits?
There are a few options, such as:
But we will try some examples and defer decision to the end.
This is a naive and model which will fail, but I want to perform it for illustrative purposes.
#### Fit the model
allvisits_lmmod_v_dt <- lm(all.visits ~ dt, data=datv)
summary(allvisits_lmmod_v_dt)
##
## Call:
## lm(formula = all.visits ~ dt, data = datv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1081845 -639263 -405678 -86184 16559830
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9085306.0 3005075.0 -3.023 0.00257 **
## dt 766.5 184.6 4.153 3.58e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1600000 on 964 degrees of freedom
## Multiple R-squared: 0.01757, Adjusted R-squared: 0.01656
## F-statistic: 17.24 on 1 and 964 DF, p-value: 3.577e-05
#### Inspect the outcome fitted to real data
### Bind forecast to orginal fitted data
# josh
#dat_allvisits_fcst <- bind_cols(datv, forecast(allvisits_lmmod_v_dt, datv) %>% as.data.frame())
dat_allvisits_fcst <- cbind(datv, forecast(allvisits_lmmod_v_dt, datv) %>% as.data.frame())
# end josh
### Plot it, vs. time
p <- ggplot(dat_allvisits_fcst, aes(dt)) +
geom_ribbon(aes(ymin=`Lo 95`, ymax=`Hi 95`), alpha=0.3, fill="orange") +
geom_point(aes(y=all.visits, color="Observed")) +
geom_line(aes(y=`Point Forecast`, color="Fitted")) +
scale_colour_manual("",
values = c("Observed"="black"
,"Fitted"="red")) +
theme(
axis.text.x = element_text(angle = 90, hjust = 1),
strip.text.y = element_text(angle = 0)) +
scale_x_date(breaks=date_breaks("month")) +
xlab("Date") +
scale_y_continuous(labels=comma) +
ylab("Visits") +
ggtitle("all.visits, Observed and Fitted vs. Date\nall.visits ~ dt + 1\nForecast\n95% prediction intervals")
print(p)
We know we can do better than this just by trying to predict the current year based on previous year. This YoY method though will have points where it will suffer from the jumps in levels. We will do it and show what happens.
As stated, it may make more sense to predict this year’s all.visits based on last year’s value. A data set with dates shifted needs to be made in order for that to happen. Below, we make the shifted data set and construct a linear model based off of it.
So, this looks a lot more realistic. You can see some issues though:
The question is: Is this good enough?
For purposes of being complete, it makes sense to look at the predicted values vs. the regressor values, taking out the time dependence. That graph is here:
# josh start again
#### As with the linear model of last year's visits, we want the freqency here to be 364
#### in order to align the days of the week.
ts_train <- ts(datv, frequency=364)
#### Create model
allvisits_hwmod <- HoltWinters(ts_train[,4], seasonal = "multiplicative")
# hwallmod allvisits_hwmod HoltWinters(ts_train[,2], gamma = FALSE)
# allvisits_hwmod %>% print()
#### Put together fitted and observed
### Alternate way of assembling, deprecate d
# hwfcst <- forecast(allvisits_hwmod)
# dat_hw_fitobs <- cbind(dt = ts_train[,"dt"], all.visits = ts_train[,"all.visits"], fitted = hwfcst$fitted) %>%
# as.data.frame() %>%
# mutate(dt = as.Date(dt)) %>%
# tbl_df()
# dat_hw_fitobs %>% head()
dat_hw_fitobs <- cbind(ts_train, fitted(allvisits_hwmod)) %>%
as.data.frame() %>%
select(dt=ts_train.dt, all.visits = ts_train.all.visits, fitted = `fitted(allvisits_hwmod).xhat`) %>%
mutate(dt = as.Date(dt)) %>%
tbl_df()
### Plot it, observed and fitted vs. regressor (all.visits.lag)
p <- ggplot(dat_hw_fitobs, aes(dt)) +
geom_point(aes(y=all.visits, color="Observed")) +
geom_line(aes(y=fitted, color="Fitted")) +
scale_colour_manual("",
values = c("Observed"="black"
,"Fitted"="red")) +
theme(
axis.text.x = element_text(angle = 90, hjust = 1),
strip.text.y = element_text(angle = 0)) +
scale_x_date(breaks=date_breaks("month")) +
xlab("Date") +
scale_y_continuous(labels=comma) +
ylab("all.visits") +
ggtitle("all.visits, Observed and Fitted vs. Date\nHolt-Winters\nForecast")
print(p)
This does better, just by visual inspection, than the linear model. Note that:
Holt-Winters gains its advantage by updating based on the previous observation in time, but the linear model fits everything all at once.
This means that Holt-Winters can be responsive to changes in level in ways that the straight “linear regression on history” cannot be.
So, is Holt-Winters better here? I would say probably, based on the assumption that future predictions will look like past ones, but have to have their level adjusted. This is up for debate however with those with business domain knowledge.
Let’s forecast all.visits based on the Holt-Winters model.
#### 90-day forecast
allvisits_fcst <- forecast(allvisits_hwmod, h=90)
#### Assemble the forecast
#### Sadly, this is currenly a bit manually intensive.
hwfcst_mean <- allvisits_fcst$mean %>% as.vector()
dat_allv_fcst <- data.frame(dt=seq.Date(from=max(ts_train[,"dt"] %>% as.vector() %>% {. + 1} %>% as.Date()), by=1, length.out = length(hwfcst_mean))
, fcst = hwfcst_mean
, lo95 = allvisits_fcst$lower[,2] %>% as.vector()
, hi95 = allvisits_fcst$upper[,2] %>% as.vector()) %>%
tbl_df()
# dat_allv_fcst
#### Put observed, fitted, and forecast all into one dataframe
# josh
#dat_allv_fcst <- full_join(dat_allv_fcst, dat_hw_fitobs, "dt") %>%
# arrange(dt)
dat_allv_fcst <- merge(dat_allv_fcst, dat_hw_fitobs, by = "dt", all = T)
# end josh
#### Plot observed, fitted, and forecast
p <- ggplot(dat_allv_fcst, aes(dt)) +
geom_ribbon(aes(ymin=lo95, ymax=hi95), alpha=0.3, fill="orange") +
geom_point(aes(y=all.visits, color="Observed")) +
geom_line(aes(y=fitted, color="Fitted")) +
geom_line(aes(y=fcst, color="Forecast")) +
scale_colour_manual("",
values = c("Observed"="black"
,"Fitted"="red"
,"Forecast"="blue")) +
theme(
axis.text.x = element_text(angle = 90, hjust = 1),
strip.text.y = element_text(angle = 0)) +
scale_x_date(breaks=date_breaks("month")) +
xlab("Date") +
scale_y_continuous(labels=comma) +
ylab("all.visits") +
ggtitle("all.visits, Observed, Fitted, and Forecast vs. Date\nHolt-Winters\nForecast")
print(p)
To recap our strategy, we will forecast all.visits and sm2all, and then multiply the forecasts to get a forecast for sm.visits. So far, we have all.visits. Next we forecast sm2all. For brevity, we will just consider the Holt-Winters models, and forecast off of that.
#### As with the linear model of last year's visits, we want the freqency here to be 364
#### in order to align the days of the week.
ts_train <- ts(datv, frequency=364)
#### Create model
sm2all_hwmod <- HoltWinters(ts_train[,5], seasonal = "multiplicative")
#### Assemble data frame of observed, fitted, and forecast
dat_hw_sm2all_fitobs <- cbind(ts_train, fitted(sm2all_hwmod)) %>%
as.data.frame() %>%
select(dt=ts_train.dt, sm2all = ts_train.sm2all, fitted = `fitted(sm2all_hwmod).xhat`) %>%
mutate(dt = as.Date(dt)) %>%
tbl_df()
#### 90-day forecast
sm2all_fcst <- forecast(sm2all_hwmod, h=90)
hwfcst_sm2all_mean <- sm2all_fcst$mean %>% as.vector()
dat_sm2all_fcst <- data.frame(
dt=seq.Date(from=max(ts_train[,"dt"] %>% as.vector() %>% {. + 1} %>% as.Date()), by=1, length.out = length(hwfcst_sm2all_mean))
, fcst = hwfcst_sm2all_mean
, lo95 = sm2all_fcst$lower[,2] %>% as.vector()
, hi95 = sm2all_fcst$upper[,2] %>% as.vector()) %>%
tbl_df()
#### Put observed, fitted, and forecast all into one dataframe
# josh
#dat_sm2all_fcst <- full_join(dat_sm2all_fcst, dat_hw_sm2all_fitobs, "dt") %>%
# arrange(dt)
dat_sm2all_fcst <- merge(dat_sm2all_fcst, dat_hw_sm2all_fitobs, by = "dt", all = T)
# end josh
#### Plot observed, fitted, and forecast
p <- ggplot(dat_sm2all_fcst %>% filter(!is.na(fcst) | !is.na(fitted)), aes(dt)) +
# geom_point(aes(y=sm2all, color="Observed")) +
geom_line(aes(y=fitted, color="Fitted")) +
geom_line(aes(y=fcst, color="Forecast")) +
scale_colour_manual("",
values = c("Observed"="black"
,"Fitted"="red"
,"Forecast"="blue")) +
theme(
axis.text.x = element_text(angle = 90, hjust = 1),
strip.text.y = element_text(angle = 0)) +
scale_x_date(breaks=date_breaks("month")) +
xlab("Date") +
scale_y_continuous(labels=comma) +
ylab("all.visits") +
ggtitle("sm2all ratio, Observed, Fitted, and Forecast vs. Date\nHolt-Winters\nForecast")
print(p)
So, if the drop due to what happened a year ago doesn’t make sense, this looks at trends over the last 7 days. This will learn its model parameters over the training period, but will forecast based on trends weighted more heavlily to the last 7 days of data.
#### Make a timeseries, set repeat frequency
ts_train <- ts(datv, frequency=7)
#### Create model
sm2all_hwmod <- HoltWinters(ts_train[,5], seasonal = "multiplicative")
#### Assemble data frame of observed, fitted, and forecast
dat_hw_sm2all_fitobs <- cbind(ts_train, fitted(sm2all_hwmod)) %>%
as.data.frame() %>%
select(dt=ts_train.dt, sm2all = ts_train.sm2all, fitted = `fitted(sm2all_hwmod).xhat`) %>%
mutate(dt = as.Date(dt)) %>%
tbl_df()
#### 90-day forecast
sm2all_fcst <- forecast(sm2all_hwmod, h=90)
hwfcst_sm2all_mean <- sm2all_fcst$mean %>% as.vector()
dat_sm2all_fcst <- data.frame(
dt=seq.Date(from=max(ts_train[,"dt"] %>% as.vector() %>% {. + 1} %>% as.Date()), by=1, length.out = length(hwfcst_sm2all_mean))
, fcst = hwfcst_sm2all_mean
, lo95 = sm2all_fcst$lower[,2] %>% as.vector()
, hi95 = sm2all_fcst$upper[,2] %>% as.vector()) %>%
tbl_df()
#### Put observed, fitted, and forecast all into one dataframe
# josh
#dat_sm2all_fcst <- full_join(dat_sm2all_fcst, dat_hw_sm2all_fitobs, "dt") %>%
# arrange(dt)
dat_sm2all_fcst <- merge(dat_sm2all_fcst, dat_hw_sm2all_fitobs, by = "dt", all = T)
# end josh
#### Plot observed, fitted, and forecast
# p <- ggplot(dat_sm2all_fcst, aes(dt)) +
p <- ggplot(dat_sm2all_fcst %>% filter(!is.na(fcst) | !is.na(fitted)), aes(dt)) +
# geom_point(aes(y=sm2all, color="Observed")) +
geom_line(aes(y=fitted, color="Fitted")) +
geom_line(aes(y=fcst, color="Forecast")) +
scale_colour_manual("",
values = c("Observed"="black"
,"Fitted"="red"
,"Forecast"="blue")) +
theme(
axis.text.x = element_text(angle = 90, hjust = 1),
strip.text.y = element_text(angle = 0)) +
scale_x_date(breaks=date_breaks("month")) +
xlab("Date") +
scale_y_continuous(labels=comma) +
ylab("all.visits") +
ggtitle("sm2all ratio, Observed, Fitted, and Forecast vs. Date\nHolt-Winters\nForecast")
print(p)
#### Join the all.visits and sm2all forecasts
dat_finall <- full_join(
dat_allv_fcst %>% select(dt, all.visits, fcst_allv = fcst, fitted_allv = fitted)
, dat_sm2all_fcst %>% select(dt, fcst_sm2all = fcst, fitted_sm2all = fitted)
, "dt") %>%
full_join(datv %>% select(dt, sm.visits), "dt") %>%
group_by(dt) %>%
mutate(sm.visits_calc = Reduce( # This picks either fitted or forecasted computation, whichever is not NA
function(x, y) if(!is.na(x)) x else y
, c(fitted_allv * fitted_sm2all, fcst_allv * fcst_sm2all))) %>%
ungroup()
#### Plot observed, fitted, and forecast
p <- ggplot(dat_finall, aes(dt)) +
geom_point(aes(y=sm.visits, color="Observed")) +
geom_line(aes(y=sm.visits_calc, color="Fitted")) +
scale_colour_manual("",
values = c("Observed"="black"
,"Fitted"="red")) +
theme(
axis.text.x = element_text(angle = 90, hjust = 1),
strip.text.y = element_text(angle = 0)) +
scale_x_date(breaks=date_breaks("month")) +
xlab("Date") +
scale_y_continuous(labels=comma) +
ylab("all.visits") +
ggtitle("Predicted sm.visits, Observed, Fitted, and Forecast vs. Date\nHolt-Winters\nForecast")
print(p)
Note that I don not have error bounds here. I haven’t sorted out how to make them from data combined this way, and I’m not sure I need to for this purpose.